System and method for scalable cost-sensitive learning

ABSTRACT

A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset into N subsets of data and developing an estimated learning model for the dataset by developing a learning model for a first subset of the N subsets.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to a technique of inductivelearning. More specifically, an inductive model is built both“accurately” and “efficiently” by dividing a database of examples into Ndisjoint subsets of data, and a learning model (base classifier),including a prediction of accuracy, is sequentially developed for eachsubset and integrated into an evolving aggregate (ensemble) learningmodel for the entire database. The aggregate model is incrementallyupdated by each completed subset model. The prediction of accuracyprovides a quantitative measure upon which to judge the benefit ofcontinuing processing for remaining subsets in the database or toterminate at an intermediate stage.

2. Description of the Related Art

Modeling is a technique to learn a model from a set of given examples ofthe form {(x₁, y₁), (x₂, y₂), . . . , (x_(n), y_(n))}. Each example(x_(i), y_(i)) is a feature vector, x_(i). The values in the featurevector could be either discrete, such as someone's marital status, orcontinuous, such as someone's age and income. Y is taken from a discreteset of class labels such as {donor, non-donor} or {fraud, non-fraud}.

The learning task is to predict a model y=f(x) to predict the classlabel from an example with a feature vector but without the true classlabel.

Inductive learning has a wide range of applications that include, forexample, fraud detection, intrusion detection, charity donation,security and exchange, loan approval, animation, and car design, amongmany others.

The present invention teaches a new framework of scalable cost-sensitivelearning. An exemplary scenario for discussing the techniques of thepresent invention is a charity donation dataset from which a subset ofthe data is to be chosen as individuals to whom to send campaignletters. Assuming that the cost of a campaign letter is $0.68, it shouldbe apparent that it would be beneficial to send a letter only if thesolicited person will donate at least $0.68.

That is, a learning model for this scenario must be taught how to chooseindividuals from a database containing information for individuals to betargeted for letters. Because there is a cost associated with theletters, and each individual will either donate different amount ofmoney or does not donate at all, this model is cost-sensitive. Theoverall accuracy or benefits is the total amount of donated charityminus the total overhead to send solicitation letters.

A second scenario is fraud detection, such as credit card frauddetection. Fraud challenging and investigation are not free. There is anintrinsic cost associated with each fraud case investigation. Assumingthat challenging a potential fraud costs $90, it is obvious that only ifthe “expected loss” of a fraud (when the same instance is sampledrepeated) is more than $90, it is worthwhile for a credit card companyto take actions.

As should be apparent, there is also a second cost associated with thedevelopment of the model that is related to the cost of the computertime and resources necessary to develop a model over a database,particularly in scenarios where the database contains a large amount ofdata.

Currently, a number of learning algorithms are conventionally used formodeling expected investment strategies in such scenarios as thecampaign letter scenario, for example, decision tree learner C4.5®, rulebuilder RIPPER®, and the naïve Bayes learner.

In a database, each data entry is described by a series of featurevalues. For the charity donation example, each entry might describe aparticular individual's income level, location lived, location worked,education background, gender, family status, past donation history, andperhaps other features.

The aforementioned C4.5® decision algorithm constructs a decision treemodel from a dataset or a set of examples of the above form. A decisiontree is a DAG (or Directed Acyclic Graph) with a single root. To build adecision tree, the learner first picks the most distinguishing featurefrom the set of features.

For example, the most distinguishing feature might be someone's incomelevel. Then, the examples in the dataset will be “sorted” by theircorresponding value of the chosen feature. For example, individual withlower income will be sorted through a different path than individualswith higher income. This process is repeated until either there is nomore feature to use or the examples in a node all belong to one singlecategory, such as donor or non-donor.

RIPPER® is another way to build inductive models. The model is a set ofIF THEN rules. The naïve Bayes method uses the Bayesian Rule to buildmodels.

Using these conventional methods, a user can experiment with differentalgorithms, parameters, and feature selections and, thereby, evaluateone or more models to be ultimately used for the intended application,such as selecting the individuals to whom campaign letters will be sent.

A problem recognized by the present inventors is that, in currentlearning model methods, the entire database must be evaluated before theeffects of the hypothetical parameters for the test model are known.Depending upon the size of the database, each such test scenario willrequire much computer time (sometimes many hours or even days) and cost,and it can become prohibitive to spend so much effort in the developmentof an optimal model to perform the intended task.

Hence, there is currently no method that efficiently models thecost-benefit tradeoff short of taking time and computer resources toanalyze the entire database and predicting the accuracy of the model forwhose parameters are undergoing evaluation.

SUMMARY OF THE INVENTION

In view of the foregoing exemplary problems, drawbacks, anddisadvantages of the conventional methods, an exemplary feature of thepresent invention is to provide a structure and method for an inductivelearning technique that significantly increases the accuracy of thebasic inductive learning model.

It is another exemplary feature of the present invention to provide atechnique in which throughput is increased by at least ten to twentytimes the throughput of the basic inductive learning model.

To achieve the above exemplary features and others, in a first exemplaryaspect of the present invention, described herein is a method (andstructure) of processing an inductive learning model for a dataset ofexamples, including dividing the dataset into N subsets of data anddeveloping an estimated learning model for the dataset by developing alearning model for a first of the N subsets.

In a second exemplary aspect of the present invention, also describedherein is a system to process an inductive learning model for a datasetof example data, including one or more of: a memory containing one ormore of N segments of the example data, wherein each segment of exampledata comprises data for calculating a base classifier for an ensemblemodel of the dataset; a base classifier calculator for developing alearning model for data in one of the N subsets; an ensemble calculatorfor progressively developing an ensemble model of the database ofexamples by successively integrating a base classifier from successiveones of the N segments; a memory interface to retrieve data from thedatabase and to store data as the inductive learning model isprogressively developed; and a graphic user interface to allow a user toat least one of enter parameters, to control the progressive developmentof the ensemble model, and to at least one of display and printoutresults of the progressive development.

In a third exemplary aspect of the present invention, also describedherein is a method of providing a service, including at least one of:providing a database of example data to be used to process an inductivelearning model for the example data, wherein the inductive learningmodel is to be derived by dividing the example data into N segments andusing at least one of the N segments of example data to derive a baseclassifier model; receiving the database of example data and executingthe above-described method of deriving the inductive learning model;providing an inductive learning model as derived in the above-describedmanner; executing an application of an inductive learning model asderived in the above-described manner; and receiving a result of theexecuting the application.

In a fourth exemplary aspect of the present invention, also describedherein is a method of deploying computing infrastructure, includingintegrating computer-readable code into a computing system, wherein thecode in combination with the computing system is capable of processingan inductive learning model for a dataset of examples by dividing thedataset into N subsets of data and developing an estimated learningmodel for the dataset by developing a learning model for a first of theN subsets.

In a fifth exemplary aspect of the present invention, also describedherein is a signal-bearing medium tangibly embodying a program ofmachine-readable instructions executable by a digital processingapparatus to perform the above-described method of processing aninductive learning model for a dataset of examples.

In a sixth exemplary aspect of the present invention, also describedherein is a method of at least one of increasing a speed of developmentof a learning model for a dataset of examples and increasing an accuracyof the learning model, including dividing the dataset into N subsets ofdata and developing an estimated learning model for the dataset bydeveloping a learning model for a first subset of the N subsets.

In a seventh exemplary aspect of the present invention, also describedherein is a method of developing a predictive model, including, for adataset comprising a plurality of elements, each element comprising afeature vector, the dataset further comprising a true class label for atleast a portion of the plurality of elements, the true class labelsallowing the dataset to be characterized as having a plurality ofclasses, dividing at least a part of the portion of the plurality ofelements having the true class label into N segments of elements, andlearning a model for elements in at least one of the N segments, as anestimate for a model for all of the dataset.

With the above and other exemplary aspects, the present inventionprovides a method to improve learning model development by increasingaccuracy of the ensemble, by decreasing time to develop a sufficientlyaccurate ensemble, and by providing quantitative measures by which auser (e.g., one developing the model or implementing an applicationbased on the model) can decide when to terminate the model developmentbecause the ensemble is predicted as being sufficiently accurate.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other exemplary features, aspects and advantages willbe better understood from the following detailed description of anexemplary embodiment of the invention with reference to the drawings, inwhich:

FIG. 1 provides a flowchart 100 of one exemplary method thatdemonstrates an overview of concepts of the present invention;

FIG. 2 provides an exemplary display 200 of a snapshot of an interactivescenario in which both accuracy and remaining training time areestimated and displayed;

FIG. 3 shows an exemplary benefit matrix 300 for the charity donationscenario;

FIG. 4 shows how the normal density curve 400 can be used to estimateaccuracy;

FIG. 5A shows a cost-sensitive decision plot 500 for a single classifierexample;

FIG. 5B shows a cost-sensitive decision plot 501 for an example ofaveraged probability of multiple classifiers;

FIG. 6A shows a plot 600 of accuracy for a credit card dataset, as afunction of a number of partitions;

FIG. 6B shows a plot 601 for total benefits for a credit card dataset,as a function of a number of partitions;

FIG. 6C shows a plot 602 for total benefits for a donation dataset, as afunction of a number of partitions;

FIG. 7A shows plots 700 of current benefits and estimated final benefitswhen sampling size k increases up to K=256 for the donation dataset;

FIG. 7B shows plots 701 of current benefits and estimated final benefitswhen sampling size k increases up to K=256 for the credit card dataset;

FIG. 7C shows plots 702 of current benefits and estimated final benefitswhen sampling size k increases up to K=256 for the adult dataset;

FIG. 8A shows plots 800 of current benefits and estimated finalestimates when sampling size k increases up to K=1024 for the donationdataset;

FIG. 8B shows plots 801 of current benefits and estimated finalestimates when sampling size k increases up to K=1024 for the creditcard dataset;

FIG. 8C shows plots 802 of current benefits and estimated finalestimates when sampling size k increases up to K=1024 for the adultdataset;

FIG. 9 shows a plot 900 of remaining training time for credit carddataset with K=256;

FIG. 10A shows a plot 1000 of serial improvement for the donationdataset when early stopping is used;

FIG. 10B shows a plot 1001 of serial improvement for the credit carddataset when early stopping is used;

FIG. 10C shows a plot 1002 of serial improvement for the adult datasetwhen early stopping is used;

FIG. 11A shows a plot 1100 of the decision threshold and probabilityoutput (true positives) by the single model for the credit card dataset;

FIG. 11B shows a plot 1101 of the decision threshold and probabilityoutput (true positives) by the 256-ensemble model for the credit carddataset;

FIG. 11C shows a plot 1102 of the decision threshold and probabilityoutput (false positives) by the single model for the credit carddataset;

FIG. 11D shows a plot 1103 of the decision threshold and probabilityoutput (false positives) by the 256-ensemble model for the credit carddataset;

FIG. 12 illustrates an exemplary hardware/information handling system1200 for incorporating the present invention therein;

FIG. 13 illustrates a signal bearing medium 1300 (e.g., storage medium)for storing steps of a program of a method according to the presentinvention; and

FIG. 14 illustrate exemplary software modules in a computer program 1400for executing the present invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS OF THE INVENTION

Referring now to the drawings, and more particularly to FIGS. 1-14,exemplary embodiments for a new framework of scalable cost-sensitivelearning are now presented. The illustrative scenario of a charitydonation database, from which is to be selected a subset of individualsto whom to send campaign letters, will continue to be used for teachingthe concepts of the present invention.

As an introduction, disclosed herein is a method and structure forlearning a model using ensembles of classifiers. First, the original,potentially large dataset is partitioned into multiple subsets. Baseclassifiers are learned from these data subsets, one by one,sequentially. The accuracy of the current ensemble comprised of modelscomputed at any point in the processing is reported to the user.

At the same time, the overall accuracy of the final ensemble comprisedof every single model computed from every data subset is statisticallyestimated and also reported to the end user. These estimates include alower bound and an upper bound, along with a confidence interval.

Remaining training time is also statistically estimated and reported tothe end user. Based on the estimated accuracy and remaining trainingtime, the end user can decide whether it is worthwhile to continue thelearning process or, instead, be contented with the current results andstop the processing of the entire dataset.

The discussion below also discloses a graphic user interface (GUI) toimplement the inventive process in practice, as well as providing thestatistical theorems to prove the soundness of the inventive approach.

FIG. 1 shows an exemplary flowchart 100 of the technique of the presentinvention. In step 101, a relevant database is partitioned first into atraining set and a validation set and then partitioned into a number Nof segments or subsets. That is, continuing with the charity donationexample, it is assumed that the database contains data on at least oneprevious campaign effort and includes relevant attributes, such as age,location, income, job description, etc., for a number of individualsfrom that earlier campaign.

Depending upon the size of the original database, the data can bedivided into a number N of segments by any appropriate method, includinga simple random technique. Since the present invention uses statisticalmodeling, it should be apparent that the size of each segment can bedetermined by techniques known in the art to incorporate a statisticallymeaningful number of individuals. It should also be apparent that thenumber N of segments will depend upon the number of entries in theoriginal database and the number of individuals required to make eachsegment statistically meaningful.

It should also be apparent to one of ordinary skill in the art, afterreading the present application, that the method of selecting the numberN is not particularly significant to the present invention, and that Ncan be selected by any number of ways. As examples, one of ordinaryskill in the art would readily recognize that the selection of N couldbe manually entered via a graphical user interface (GUI), as one inputparameter provided by the user during the initial parameter inputs forthe model development process, or N might be automatically determined bya software module that first evaluates the size of the database and thenautomatically determines a number N of database segments, as based onsuch factors as statistical constraints and the size of the database.

In step 102, a model, hereinafter also referred to as a “baseclassifier”, for each segment is sequentially trained. In the exemplaryembodiment, each base classifier becomes an incremental input into thefinal model, hereinafter also referred to as the “ensemble”, for theoverall database data. That is, the base classifiers incrementally areintegrated to form the ensemble model.

In step 103, the evolving ensemble model is displayed, as itprogressively develops.

In step 104, the user can optionally continue the process for the nextincrement (e.g., the base classifier for the next subset of the Nsubsets of data). Although this flowchart shows termination as optionalonly upon completion of each segment base classifier, it would bereadily recognized by one of ordinary skill in the art, after readingthe present application, that such termination could actually occur atany time during the processing.

When the processing is stopped in step 104, either prematurely by theuser or because all segments have been modeled, the user can thendecide, in step 106, whether the intended application should be executedin step 107 in order to, for example, display or print out the names ofindividuals from a database to whom letters are to be sent for thecampaign, or even print out the letters and envelopes for these selectedindividuals.

In the terminology of the present invention, each of the subsetscontains data to train a “classifier”. The classifier is a model trainedfrom the data. A “base classifier” is a classifier trained from eachsubset.

As can be seen by the discussion above, a key aspect of the presentinvention, in which subsets are each modeled to incrementally form acomposite model, is that the composite modeling can be easily stopped atany early or intermediate stage.

Thus, considering the above example in more detail, in a databasecontaining, for example, 1,000,000 individuals, there might beexemplarily 100 subsets, each including 10,000 individuals. Dependingupon modeling complexity, current methods for developing a completemodel for the entire 100 subsets might take, for example, several hoursor even days of computer time.

In contrast, using the present invention, based on results of theinitial subset models, the user is able to determine whether the timeand expense of continuing to develop a complete model would be costeffective or whether to stop the processing and enter a new set of modelparameters to re-evaluate a new strategy for the learning modeldevelopment.

It should be apparent that the user might continue entering new sets ofparameters for evaluation, until a set of model parameters is finallydetermined as being satisfactory. Moreover, using the present invention,the user will also be able to see a quantitative prediction for theresults of each current set of parameters.

In more detail, as soon as learning starts, the technique of the presentinvention begins to compute intermediate models, and, exemplarily, alsoto report current accuracy and estimated final accuracy, on a holdoutvalidation set, and estimated remaining training time. For acost-sensitive problem, accuracy is measured in benefits such as dollaramounts.

The term “accuracy” is meant herein to interchangeably mean traditionalpercentage accuracy (that measures the percentage of examples beingclassified correctly) and benefits (in terms of dollar amount, such asthe total amount of donated charity minus the cost of mailing, in thecharity donation example).

FIG. 2 shows an exemplary snapshot of the learning process in accordancewith the present invention, using a graphic user interface (GUI) display200 in an interactive scenario where both accuracy and remainingtraining time are estimated.

The exemplary GUI display in FIG. 2 indicates that the accuracy 203, 203on the holdout validation set (total donated charity, minus the cost ofmailing to both donors and non-donors) 201 for the algorithm using thecurrent intermediate model is $12,840.50. In this exemplary snapshot,the accuracy 202, 203 of the complete model on the holdout validationset, when learning completes, is estimated to be $14,289.50±100.3 withat least 99.7% confidence 204. The additional training time 205, 206 togenerate the complete model is estimated to be 5.40±0.70 minutes with atleast 99.7% confidence.

Currently, as displayed in the lower indicator 207, approximately 35% ofthe database contents have been processed up through the snapshot shownin FIG. 2. The information on the display 200 continuously refresheswhenever a new intermediate model is produced, until either the userexplicitly terminates the learning process (e.g., using the “STOP”command input command 208 in FIG. 2) or the complete model is generatedfor all segments S_(j).

In this scenario above, the user may stop the learning process at anytime, exemplarily due to at least any one of the following reasons:

-   -   i) the intermediate model has enough accuracy;    -   ii) the intermediate model's accuracy is not significantly        different from that of the complete model;    -   iii) the estimated accuracy of the complete model is too low;        or,    -   iv) the training time is unexpectedly long.

More specifically, for the example snapshot shown in FIG. 2, the userprobably would want to continue the modeling, since it is worthwhile tospend approximately six more minutes to receive at least approximately$1,400 more donation (e.g., $14,289.50-$12,840.50), given a 99.7%confidence.

One of ordinary skill in the art would also readily recognize, afterhaving read this application, that processing could be automaticallyterminated if accuracy or training time exceeds a predetermined ormanually-entered threshold.

In this example, progressive modeling is applied to cost-sensitivelearning. For cost-insensitive learning, the algorithm reportstraditional accuracy in place of dollar amounts. “Cost-sensitive” meansthat each example carries a different benefit, such that differentindividuals may donate different amounts of money or do not donate atall. In contrast, “cost-insensitive” means that each example is equallyimportant.

The overall accuracy is the total amount of rewards one would get bypredicting correctly. Obviously, for a cost-sensitive application, oneshould concentrate on those individuals with a lot of donation capacity.

As will be explained later in more detail, this framework ofscalable-cost sensitive learning is significantly more useful than abatch mode learning process, especially for a very large dataset.Moreover, with the technique of the present invention, the user caneasily experiment with different algorithms, parameters, and featureselections without waiting for a long time for a result ultimatelydetermined as being unsatisfactory.

Therefore, the present invention is capable of generating a relativelysmall number of base classifiers to estimate the performance of theentire ensemble when all base classifiers are produced.

Without a loss of generality for discussing the underlying theory of thetechnique of the present invention, it is assumed that a training set Sis partitioned into K disjoint subsets S_(j), and that each subset isequal in size. As to the sequence in processing the subsets, if it isassumed that the distribution of the dataset is uniform, each subset canbe taken sequentially. Otherwise, the dataset can either be completely“shuffled”, or random sampling without replacement can be used, to drawS_(j) (e.g., select one of the subsets to be processed next).

A base level model C_(j) is then trained from S_(j). If there is noadditional data, S_(j) can be used for both training and validation.Otherwise, S_(j) is used for training and a completely separate holdoutset apart from S (e.g., a superset of S_(j)) is used for validation.

Given an example x from a validation set S_(v) (it can be a differentdataset or the training set), model C_(j) outputs probabilities for allpossible class labels that x may be an instance of, i.e., p_(j)(l_(i)|x) for class label l_(i). Classes l_(i) are structures in thedataset, such as “donor”, “non-donor”, “fraud”, and “non-fraud”. Detailson how to calculate p_(j) (l_(i)|x) are found below. In addition, abenefit matrix b[l_(i), l_(j)] records the benefit received bypredicting an example of class l_(i) to be an instance of class l_(i).

An exemplary benefit matrix 300 for the charitable donation, in whichthe cost of sending a letter is assumed to be $0.68, is shown in FIG. 3.It can be seen that there are two possible predictions 301: either anindividual “will donate” or the individual “will not donate”. There arealso two possible actual outcomes 302: either the individual does“donate” or the individual “does not donate”.

The benefit matrix provides the benefit for each possibleprediction/outcome:

-   -   the benefit 303 if the individual is predicted to donate and        does donate would be Y(x) −$0.68;    -   the benefit 304 if the individual is predicted to donate but        does not donate would be −$0.68; and    -   the benefit 305, 306 if the individual is predicted to “not        donate” is zero, since no letter would be sent to that        individual.

In contrast, for cost-insensitive (or accuracy-based) problems, ∀i,b[l_(i), l_(j)]=1 and ∀i≠j, b[l_(i), l_(j)]=0. Since traditionalaccuracy-based decision making is a special case of cost-sensitiveproblem, only the algorithm in the context of cost-sensitive decisionmaking is discussed herein. Using the benefit matrix b[ . . . ], eachmodel Cj will generate an expected benefit or risk e_(j) (l_(i)|x) forevery possible class l_(i). $\begin{matrix}{{{Expected}\quad{Benefit}\text{:}\quad{e_{j}( l_{i} \middle| x )}} = {\sum\limits_{l_{i^{\prime}}}{{b\lbrack {l_{i^{\prime}},l_{i}} \rbrack} \cdot {p_{j}( l_{i^{\prime}} \middle| x )}}}} & (1)\end{matrix}$

It is now assumed that k, k≦K, models {C₁, . . . , C_(k)} have beentrained. Combining individual expected benefits, mathematically:$\begin{matrix}{{{Average}\quad{Expected}\quad{Benefit}\text{:}\quad{E_{k}( l_{i} \middle| x )}} = \frac{\sum\limits_{j}{e_{j}( l_{i} \middle| x )}}{k}} & (2)\end{matrix}$

Optimal decision policy can now be used to choose the class label withthe maximal expected benefit:Optimal Decision: L _(k)(x)=argmax_(l) _(i) E _(k)(l _(i) |x)  (3)

Assuming that l (x) is the true label of x, the accuracy of the ensemblewith k classifiers is: $\begin{matrix}{A_{k} = {\sum\limits_{x \in S_{v}}{b\lbrack {{l(x)},{L_{x}(x)}} \rbrack}}} & (4)\end{matrix}$

For accuracy-based problems, A_(k) is usually normalized into apercentage using the size of the validation set |S_(v)|. Forcost-sensitive problems, it is customary to use some units to measurebenefits such as dollar amounts. Besides accuracy, there is also thetotal time to train C₁ to C_(k):T_(k)=the total time to train {C₁, . . . , C_(k)}  (5)

Next, based on the performance of k≦K base classifiers, statisticaltechniques are used to estimate both the accuracy and training time ofthe ensemble with K models.

However, first, some notations are summarized. A_(K), T_(K) and M_(K)are the true values to estimate. Respectively, they are the accuracy ofthe complete ensemble, the training time of the complete ensemble, andthe remaining training time after k classifiers. Their estimates aredenoted in lower case, i.e., a_(K), t_(K) and m_(K).

An estimate is a range with a mean and standard deviation. The mean of asymbol is represented by a bar ({overscore ( )}) and the standarddeviation is represented by a sigma (σ) Additionally, σ_(d) is standarderror or the standard deviation of a sample mean.

Estimating Accuracy

The accuracy estimate is based on the probability that l_(i) is thepredicted label by the ensemble of K classifiers for example x.P{L _(K)(x)=l _(i)}  (6)is the probability that l_(i) is the prediction by the ensemble of sizeK. Since each class label l_(i) has a probability to be the predictedclass, and predicting an instance of class l (x) as l_(i) receives abenefit b[l (x), l_(i)], the expected accuracy received for x bypredicting with K base models is: $\begin{matrix}{{\overset{\_}{\alpha}(x)} = {\sum\limits_{l_{i}}{{{b\lbrack {{l(x)},l_{i}} \rbrack} \cdot P}\{ {{L_{K}(x)} = l_{i}} \}}}} & (7)\end{matrix}$with standard deviation of σ(α(x)). To calculate the expected accuracyon the validation set S_(v), p the expected accuracy on each example xis summed up: $\begin{matrix}{{\overset{\_}{\alpha}}_{K} = {\sum\limits_{x \in {Sv}}{\overset{\_}{\alpha}(x)}}} & (8)\end{matrix}$

Since each example is independent, according to the multinomial form ofthe central limit theorem (CLT), the total benefit of the complete modelwith K models is a normal distribution with mean value of Eqn. [8] andstandard deviation of: $\begin{matrix}{{\sigma( a_{K} )} = \sqrt{\sum\limits_{x \in S_{v}}{\sigma( {a(x)} )}^{2}}} & (9)\end{matrix}$Using confidence intervals, the accuracy of the complete ensemble A_(K)falls within the following range:With confidence p, A_(K)ε{overscore (α_(K))}±t·σ(a_(K))  (10)When t=3, the confidence p is approximately 99.7%.

Next is discussed the process of deriving P{LK(x)=l_(i)}. IfE_(K)(l_(i)|x) is known, there is only one label, L_(K)(X) whoseP{L_(K)(x)=l_(i)} will be 1, and all other labels will have probabilityequal to 0. However, if E_(K)(l_(i)|x) is not known, only its estimateE_(k)(l_(i)|x) measured from k classifiers to derive P{L_(K)(x)=l_(i)}can be used.

From random sampling theory, E_(k)(l_(i)|x) is an unbiased estimate ofE_(K)(l_(i)|x) with standard error of:σ_(d)(E _(k)(l _(i) |x))=^(σ(E) ^(k) ^((l) ^(i) ^(|x))) /{squareroot}{square root over (k)}· ^({square root}{square root over (1−f)})where f=k/K  (11)

According to the central limit theorem, the true value E_(K)(l_(i)|x)falls within a normal distribution with mean value of μ=E_(k)(l_(i)|x)and standard deviation of σ=σ_(d) (E_(k)(l_(i)|x)). If E_(k)(l_(i)|x) ishigh, it is more likely for E_(K)(l_(i)|x) to be high, and consequently,for P{L_(k)(x)=l_(i)} to be high.

For the time being, the correlation among different class labels can beignored, and naïve probability P′{L_(K)(x)=l_(i)} can be computed.Assuming that r_(t) is an approximate of max l_(i) (E_(K)(l_(i)|x)), thearea 401 in the range of [r_(t), +∞] is the probabilityP′{L_(K)(x)=l_(i)}, as exemplarily shown in FIG. 4: $\begin{matrix}{{P^{\prime}\{ {{L_{K}(x)} = l_{i}} \}} = {\int_{r_{1}}^{+ \infty}{\frac{1}{\sqrt{2\pi}\sigma}{\exp\lbrack {{- \frac{1}{2}}( \frac{z - u}{\sigma} )^{2}} \rbrack}{\mathbb{d}z}}}} & (12)\end{matrix}$where σ=σ_(d)(E_(K)(l_(i)|x)) and μ=E_(K)(l_(i)|x).

When k≦30, to compensate the error in standard error estimation, theStudent-t distribution with df=k can be used. The average of the twolargest E_(K)(l_(i)|x)'s is used to approximate max_(t) _(i)(E_(K)(l_(i)|x)).

The reason not to use the maximum itself is that if the associated labelis not the predicted label of the complete model, the probabilityestimate for the true predicted label may be too low.

On the other hand, P{L_(k)(x)=l_(i) } is inversely related to theprobabilities for other class labels to be the predicted label. When itis more likely for other class labels to be the predicted label, it willbe less likely for l_(i) to be the predicted label. A common method totake correlation into account is to use normalization, $\begin{matrix}{{P\{ {{L_{k}(x)} = l_{i}} \}} = \frac{P^{\prime}\{ {{L_{K}(x)} = l_{i}} \}}{\sum\limits_{j}{P^{\prime}\{ {{L_{K}(x)} = l_{j}} \}}}} & (13)\end{matrix}$

Thus, P{L_(k)(x)=l_(i)} has been derived, in order to estimate theaccuracy in Eqn. [7].

Estimating Training Time

It is assumed that the training time for the sampled k models are τ_(l)to τ_(k). Their mean and standard deviation are {overscore (τ)} andσ(τ). Then the total training time of K classifiers is estimated as,with confidence p, T_(K)ε{overscore (t)}_(K)±t·σ(t_(K)) where {overscore(t)}_(K)=K·{overscore (τ)} and $\begin{matrix}{{\sigma( t_{K} )} = {\frac{t \cdot K \cdot {\sigma(\tau)}}{\sqrt{k}} \cdot {\sqrt{1 - f}.}}} & (14)\end{matrix}$

To find out remaining training time M_(K), k·{overscore (τ)} is simplydeducted from Eqn. [14], with confidence p, M_(K)ε{overscore(m_(K))}±t·σ(m_(K)) where {overscore (m_(K))}={overscore(t_(K))}−k·{overscore (τ)} andσ(m _(K))=σ(t _(K))  (15)Putting It Together

In comparing FIG. 1 with the basic algorithm shown below, details of anexemplary embodiment of the present invention should now be apparent. Inthe first step, the first random sample from the database is requestedand the first model C₁ is trained. Then, the second random sample isrequested and the second model C₂ is trained.

From this point on, in this exemplary embodiment, the user will beupdated with estimated accuracy, remaining training time and confidencelevels. The accuracy of the current model (A_(k)), the estimatedaccuracy of the complete model (α_(K)), as well as estimated remainingtraining time (m_(K)) are all available. From these statistics, the userdecides to continue or terminate. Typically, the user would usuallyterminate learning if one of the following stopping criteria are met:

-   -   The accuracy of the current model is sufficiently high. That is,        assuming that θ_(A) is the target accuracy, this criterion        becomes: A_(k)≧θ_(A).    -   The accuracy of the current model is sufficiently close to that        of the complete model. That is, there would not be significant        improvement by training the model to the end. More precisely,        and using the terminology above, t·σ(α_(K))≦ε.    -   The estimated accuracy of the final model is too low to be        useful. More formally, if ({overscore        (α_(K))}+t·σ(α_(K)))<<θ_(A), then stop the learning process.    -   The estimated training time is too long, and the user decides to        abort. More formally, assuming that θ_(T) is the target training        time, then, if ({overscore (m_(K))}−t·σ(m_(K)))>>θ_(T), the        learning process should be canceled.

As a summary of all the important steps of progressive modeling, anexemplary algorithm, described in code summary format, is outlined belowas Algorithm 1: Algorithm 1: (Progressive Modeling Based on AveragingEnsemble) Data  : benefit matrix b[ ], training set S, validation setSv, and K Result : k ≦ K classifiers begin   partition S into K disjointsubsets of equal size {S1, ..., Sk};   train C1 from S1 and τ1 is thetraining time;   k

2;   while k ≦ K do     train Ck from Sk and τk is the training time;    for x ∈ S_(ν) do         calculate P {LK = } (Eqn. [13])        calculate and its standard deviation ((Eqn.[7]);     end    estimate accuracy(Eqn.[8], Eqn. [9]) and remaining training time    (Eqn. [15]);     if and satisfy stopping criteria then        return C1, ..., Ck;     end     k

k + 1;   end   return C1, ..., Ck; endEfficiency

Computing K base models sequentially has complexity of$K \cdot {{O( {f( \frac{N}{K} )} )}.}$Both the average and standard deviation can be incrementally updatedlinearly in the number of examples.Desiderata

The obvious advantage of the above averaging ensemble is its scalabilityand its ability to be estimated. The accuracy is also potentially higherthan a single model trained in batch-mode from the entire dataset.

That is, the base models trained from disjoint data subsets makeuncorrelated noisy errors to estimate expected benefits. It is known andhas been studied that uncorrelated errors are reduced by averaging. Theaveraged expected benefits may still be different from the singleclassifier, but it may not make a difference to final prediction, aslong as the predicted label by the single model remains to be the labelwith the maximum expected benefit.

The multiple model is very likely to have higher benefits because of its“smoothing effect” and stronger bias towards predicting expensiveexamples correctly. It is noted that the only interest is that ofwell-defined cost-sensitive problems (as contrary to ill-definedproblems) where ∀x, b [l(x), l(x)]≧b [l(x), l_(j)].

In other words, correct prediction is always better thanmisclassification. For well-defined problems, E(l(x),x) is monotonic inp(l(x)|x). In order to make correct predictions, p(l(x)|x) has to bebigger than a threshold T(x), which is inversely proportional to b[l(x),l(x)].

As an example, for the charity donation dataset,${T(x)} = \frac{{\$ 0}{.68}}{y(x)}$where y(x) is the donation amount and $0.68 is the cost to send acampaign letter. To explain the “smoothing effect”, the cost-sensitivedecision plot is used.

For each data point x, its decision threshold T(x) and probabilityestimate p(l(x)|x) is plotted in the same figure. The sequence ofexamples on the x-axis is ordered increasingly by their T(x) values.

FIGS. 5A and 5B illustrate two exemplary plots. FIG. 5A is conjecturedfor a single classifier, while FIG. 5B is conjectured for averagedprobability of multiple classifiers. All data points above the T(x) lineare predicted correctly.

Using these plots, the smoothing effect is now explained. Sinceprobability estimates by multiple classifiers are uncorrelated, it isvery unlikely for all of them to be close to either 1 or 0 (theextremities) and their resultant average will likely spread more“evenly” between 1 and 0. This is visually illustrated in these twofigures by comparing the plot 501 in FIG. 5B to the plot 500 in FIG. 5A.

The smoothing effect favors more towards predicting expensive examplescorrectly. Thresholds T(x) of expensive examples are low. These examplesare in the left portion of the decision plots. If the estimatedprobability by single classifier p(l(x)|x) is close to 0, it is verylikely for the averaged probability p′(l(x)|x) to be bigger than p(l(x)|x)), and, consequently, bigger than T(x) of expensive examples andpredict them to be positive. The two expensive data points 502, 503 inthe bottom left corner of the decision plots are misclassified by thesingle classifier.

However, they are correctly predicted by the multiple model (labels 504,505). Due to the smoothing effect, averaging of multiple probabilitiesbiases more towards expensive examples than the single classifier. Thisis a desirable property since expensive examples contribute greatlytowards total benefit. Cheaper examples have higher T(x), and they areshown in the right portion of both plots in FIGS. 5A and 5B.

If single classifier p(l(x)|x) for a cheap example is close to 1, it ismore likely for the averaged probability p′(l(x)|x) to be lower thanp(l(x)|x), and consequently lower than T(x) to be misclassified.However, cheap examples carry much less benefit than expensive examples.The bias towards expensive examples by the multiple model 501 still haspotentially higher total benefits than the single model 500.

Calculating Probabilities

The calculation of p(l_(i)|x) is straightforward. For decision trees,such as C4.5®, and supposing that n is the total number of examples andn_(i) is the number of examples with class l_(i) in a leaf, then$\begin{matrix}{{p( \ell_{i} \middle| x )} = {\frac{n_{i}}{n}.}} & (16)\end{matrix}$

For cost-sensitive problems, in order to avoid skewed probabilityestimate at the leaf of a tree, curtailed probabilities or curtailmentcan be computed as has been proposed (e.g., see B. Zadrozny and C.Elkan, “Obtaining calibrated probability estimates from decision treesand naïve bayesian classifiers”, Proceedings of Eighteenth InternationalConference on Machine Learning (ICML'2001), 2001.)

The search down the tree is stopped if the current node has fewer than vexamples, and the probabilities are computed as in Eqn. [16]. Theprobabilities for decision rules, e.g. RIPPER®, are calculated in asimilar way as decision trees.

For naive Bayes classifier, assuming that α_(j)'s are the attributes ofx, p(l_(i)) is the prior probability or frequency of class l_(i) in thetraining data, and p(α_(j)|l_(i)) is the prior probability to observefeature attribute value α_(j) given class label l_(i), then the scoren(l_(i)|x) for class label l_(i) is:n(l _(i) |x)=p(l _(i))Πp(α_(j) |l _(i)),  (17)and the probability is calculated on the basis of n(l_(i)|x) as:$\begin{matrix}{{p( \ell_{i} \middle| x )} = \frac{n( \ell_{i} \middle| x )}{\sum{n( \ell_{i^{\prime}} \middle| x )}}} & (18)\end{matrix}$

The above probability estimate is known to be skewed. For cost-sensitiveproblems, it has been proposed to divide the score n(l_(i)|x) intomultiple bins and compute the probability p(l_(i)|x) from each bin.

Experiment

In this experiment, there are two main issues: the accuracy of theensemble and the precision of the estimation. The accuracy and trainingtime of a single model computed from the entire dataset is regarded asthe baseline.

To study the precision of the estimation methods, the upper and lowererror bounds of an estimated value are compared to its true value. Inthis discussion, three datasets have carefully been selected. They arefrom real world applications and significant in size. Each dataset isused both as a traditional problem that maximizes traditional accuracyas well as a cost-sensitive problem that maximizes total benefits. As acost-sensitive problem, the selected datasets differ in the way as tohow the benefit matrices are obtained.

Datasets

The first dataset is the donation dataset that first appeared inKDDCUP'98 competition. It is supposed that the cost of requesting acharitable donation from an individual x is $0.68, and the best estimateof the amount that x will donate is Y(x). Its benefit matrix is shown inFIG. 3.

As a cost-sensitive problem, the total benefit is the total amount ofreceived charity minus the cost of mailing. The data has already beendivided into a training set and a test set. The training set includes95,412 records for which it is known whether or not the person made adonation and how much the donation was. The test set contains 96,367records for which similar donation information was not published untilafter the KDD'98 competition.

The standard training/test set splits were used to compare with previousresults. The feature subsets were based on the KDD'98 winningsubmission. To estimate the donation amount, the multiple linearregression method was used. To avoid over estimation, only thosecontributions between $0 and $50 were used.

The second data set is a credit card fraud detection problem. Assumingthat there is an overhead $90 to dispute and investigate a fraud andy(x) is the transaction amount, the following is the benefit matrix:Predict fraud Predict not fraud Actual fraud y(x) − $90 0 Actual notfraud −$90 0

As a cost-sensitive problem, the total benefit is the sum of recoveredfrauds minus investigation costs. The dataset was sampled from aone-year period and contains a total of 5M transaction records. Thefeatures record the time of the transaction, merchant type, merchantlocation, and past payment and transaction history summary. Data of thelast month was used as test data (40, 038 examples) and data of previousmonths as training data (406, 009 examples).

The third dataset is the adult dataset from UCI repository. It is awidely used dataset to compare different algorithms on traditionalaccuracy. For cost-sensitive studies, a benefit of $2 is artificiallyassociated to class label F and a benefit of $1 to class label N, assummarized below: Predict F Predict N Actual F $2 0 Actual N 0 $1

The natural split of training and test sets is used, so the results canbe easily duplicated. The training set contains 32,561 entries and thetest set contains 16,281 records.

Experimental Setup

Three learning algorithms were selected: decision tree learner C4.5®,rule builder RIPPER®, and naïve Bayes learner. A wide range ofpartitions, K∈ {8, 16, 32, 64, 128, 256} were chosen. The accuracy andestimated accuracy is the test dataset.

Accuracy

Since the capability of the new framework for both traditionalaccuracy-based problems is studied, as well as cost-sensitive problems,each dataset is treated both as a traditional and cost-sensitiveproblem. The baseline traditional accuracy and total benefits of thebatch mode single model are shown in the two columns under accuracy fortraditional accuracy-based problem and benefits for cost-sensitiveproblem respectively in Table 1, below. TABLE 1 Accuracy BasedCost-sensitive accuracy benefit for C4.5 ®: Donation 94.94% $13,292.7Credit Card 87.77% $733,980 Adult 84.38% $16,443 for RIPPER ®: Donation94.94% $0 Credit Card 90.14% $712,541 Adult 84.84% $19,725 for NB:Donation 94.94% $13,928 Credit Card 85.46% $704,285 Adult 82.86% $16,269

These results are the baseline that the multiple model should achieve.It is noted that different parameters for RIPPER® on the donationdataset were experimented with. However, the most specific rule producedby RIPPER® contains only one rule that covers six donors and one defaultrule that always predicts donate. This succinct rule will not find anydonor and will not receive any donations. However, RIPPER® performsreasonably well for the credit card and adult datasets.

For the multiple model, the results are first discussed when thecomplete multiple model is fully constructed. Then, the results ofpartial multiple model are presented. Each result is the average ofdifferent multiple models with K ranging from 2 to 256. In Table 2below, the results are shown in two columns under accuracy and benefit.TABLE 2 Accuracy Based Cost-sensitive accuracy benefit for C4.5 ®:Donation 94.94 +/− 0% $14,702.9 +/− 458 Credit Card 90.37 +/− 0.5% $804,964 +/− 32,250 Adult  85.6 +/− 0.6%   $16,435 +/− 150 forRIPPER ®: Donation 94.94 +/− 0%      $0 +/− 0 Credit Card 91.46 +/− 0.6% $815,612 +/− 34,730 Adult  86.1 +/− 0.4%   $19,875 +/− 390 for NB:Donation 94.94 +/− 0%   $14,282 +/− 530 Credit Card 88.64 +/− 0.3% $798,943 +/− 23,557 Adult 84.94 +/− 0.3%   $16,169 +/− 60

As the respective results in Tables 1 and 2 are compared, the multiplemodel consistently and significantly beat the accuracy of the singlemodel for all three datasets, using all three different inductivelearners. The most significant increase in both accuracy and totalbenefits is for the credit card dataset. The total benefits have beenincreased by approximately $7,000˜$10,000; the accuracy has beenincreased by approximately 1%˜3%. For the KDDCUP'98 donation dataset,the total benefit has been increased by $1400 for C4.5® and $250 for NB.

Next, the trends of accuracy are studied when the number of partitions Kincreases. In FIGS. 6A, 6B, and 6C, the accuracy and total benefits 600,601, 602 for the credit card datasets and the total benefits for thedonation dataset with increasing number of partitions K are plotted. Thebase learner for this study was C4.5®.

It can be clearly seen that for the credit card dataset, the multiplemodel consistently and significantly improve both the accuracy and totalbenefits over the single model by at least 1% in accuracy and $40,000 intotal benefits for all choices of K. For the donation dataset, themultiple model boosts the total benefits by at least $1400. Nonetheless,when K increases, both the accuracy and total tendency show a slowdecreasing trend. It would be expected that when K is extremely large,the results will eventually fall below the baseline.

Accuracy Estimation

The current and estimated final accuracy are continuously updated andreported to the user. The user can terminate the learning based on thesestatistics.

As a summary, these include the accuracy of the current model A_(k), thetrue accuracy of the complete model A_(K) and the estimate of the trueaccuracy {overscore (a)}_(K) with σ(α_(K)).

If the true value falls within the error range of the estimate with highconfidence and the error range is small, the estimate is good. Moremathematically formally, with confidence p, A_(K)∈{overscore(α)}_(K)±t·σ(α_(K)). Quantitatively, it can be said that an estimate isgood if the error bound (t·σ) is within 5% of the mean and theconfidence is at least 99%.

If k is assumed to be chosen such that k=20%·K, then in Table 3 below isshown the average of estimated accuracy of multiple models withdifferent number of partitions K, where K is an element of the set {8,16, 32, 64, 123, 256}. The true value A_(K) all fall within the errorrange. The sampling size is 20% of population size K. The number inestimated accuracy is the average of estimated accuracy with differentK's. The error range is 3·σ(α_(K)), with 99.7% confidence. TABLE 3Accuracy Based Cost-sensitive True Val Estimate True Val Estimate ForC4.5 ® Donation 94.94% 94.94% +/− 0% $14,702.90  $14,913 +/− 612 CreditCard 90.37% 90.08% +/− 1.5% $804,964 $799,876 +/− 3,212 Adult  85.6% 85.3% +/− 1.4% $16,435  $16,255 +/− 142 For RIPPER ® Donation 94.94%94.94% +/− 0% $0     $0 +/− 0 Credit Card 91.46    91.24% +/− 0.9%$815,612 $820,012 +/− 3,742 Adult  86.1%  85.9% +/− 1.3% $19,875 $19,668 +/− 258 For NB Donation 94.94% 94.94% +/− 0% $14,282  $14,382+/− 120 Credit Card 88.64% 89.01% +/− 1.2% $798,943 $797,749 +/− 4,523Adult 84.94%  85.3% +/− 1.5% $16,169  $16,234 +/− 134

To see how quickly the error range converges with increasing samplesize, the entire process is drawn to sample up to K=256 for all threedatasets, as shown in FIGS. 7A, 7B, and 7C. The error range is3·σ(α_(K)) for 99.7% confidence.

There are four curves in each plot. The one on the very top and the oneon the very bottom are the upper and lower error bounds. The currentbenefits and estimated total benefits are within the higher and lowererror bounds. Current benefits and estimated total benefits are veryclose especially when k becomes big.

As shown clearly in all three plots, the error bound decreasesexponentially. When k exceeds 50 (approximately 20% of 256), the errorrange is already within 5% of the total benefits of the complete model.If the accuracy of the current model is satisfactory, the learningprocess can be discontinued and the current model returned.

For the three datasets under study and different number of partitions K,when k>30% K, the current model is usually within 5% error range oftotal benefits by the complete model. Moreover, for traditionalaccuracy, the current model is usually within 1% error bound of theaccuracy by the complete model (detailed results not shown).

Next, an experiment under extreme situations is discussed. When Kbecomes too large, each dataset becomes trivial and will not be able toproduce an effective model. If the estimation methods can effectivelydetect the inaccuracy of the complete model, the user can choose asmaller K.

All three dataset were partitioned into K=1024 partitions. For the adultdataset, each partition contains only 32 examples, but there are 15attributes. The estimation results 800, 801, 802 are shown in FIGS. 8A,8B, and 8C.

The first observation is that the total benefits for donation and adultare much lower than the baseline. This is obviously due to the trivialsize of each data partition. The total benefits for the credit carddataset is $750,000, which is still higher than the baseline of$733,980.

The second observation is that after the sampling size k exceeds aroundas small as 25 (out of K=1024 or 0.5%), the error bound becomes smallenough. This implies that the total benefits by the complete model isvery unlikely (99.7% confidence) to increase. At this point, the usershould realistically cancel the learning for both donation and adultdatasets.

The reason for the “bumps” in the adult dataset plot is that eachdataset is too small and most decision trees will always predict N mostof the time. At the beginning of the sampling, there are no variationsor all the trees make the same predictions. When more trees areintroduced, it starts to have some diversities. However, the absolutevalue of the bumps are less than $50, as compared to $12,435.13.

Table 3 above shows the true accuracy and estimated accuracy. Thesampling size is 20% of population size K, where K∈ {8, 16, 32, 64, 128,256}. The number in estimated accuracy is the average of estimatedaccuracy with different K's. The error range is 3·σ(α_(K)) for 99.7%confidence.

Training Time Estimation

The remaining training time 900 using the sampled k base classifiers isalso estimated. Only the results for credit card fraud detection withK=256 are shown in FIG. 9. The true remaining training time and itsestimate are identical.

Training Efficiency

Both the training time of the batch mode single model, plus the time toclassify the test data are recorded, as well as the training time of themultiple model with k=30%·K classifiers, plus the time to classify thetest data k times. The ratio of the recorded time of the single andmultiple models, called serial improvement, is then computed. This isthe number of times that training the multiple model is faster thantraining the single model.

In FIGS. 10A, 10B, and 10C, the serial improvement 1000, 1001, 1002 isplotted for all three datasets, using C4.5 as the base learner. WhenK=256, using the multiple model not only provides higher accuracy, butthe training time is also 80 times faster for credit card, 25 timesfaster for both adult and donation.

Smoothing Effect

In FIGS. 11A, 11B, 11C, and 11D, decision plots (as defined above) 1100,1101, 1102, 1103 are plotted for the credit card fraud dataset. K ischosen so that K=256 for the multiple model. The number on each plotshows the number of examples (to show these numbers clearly on the plot,the surrounding data points around the text area are not plotted) whoseP(x)>T(x) (predicted as frauds).

The top two plots (FIGS. 11A and 11B) are fraudulent transactions andthe bottom plots (FIGS. 11C and 11D) are non-fraudulent transactions.The overall effect of the averaging ensemble increases the number oftrue positives from 1150 to 1271 and the number of false positives from1619 to 2192. However, the average transaction amount of the “extranumber” of detected frauds by the ensemble (121=1271-1150) is around$2400, which greatly overcomes the cost of extra false alarm ($90 perfalse alarm).

Thus, as demonstrated above, for problems like credit card fraud,donation, and catalog mailing, where positive examples have variedprofits and negative examples have low or fixed cost, the ensemblemethods tend to beat the single model.

Exemplary Hardware Implementation

FIG. 12 illustrates a typical hardware configuration of an informationhandling/computer system 1200 in accordance with the invention and whichpreferably has at least one processor or central processing unit (CPU)1211.

The CPUs 1211 are interconnected via a system bus 1212 to a randomaccess memory (RAM) 1214, read-only memory (ROM) 1216, input/output(I/O) adapter 1218 (for connecting peripheral devices such as disk units1221 and tape drives 1240 to the bus 1212), user interface adapter 1222(for connecting a keyboard 1224, mouse 1226, speaker 1228, microphone1232, and/or other user interface device to the bus 1212), acommunication adapter 1234 for connecting an information handling systemto a data processing network, the Internet, an Intranet, a personal areanetwork (PAN), etc., and a display adapter 1236 for connecting the bus1212 to a display device 1238 and/or printer 1239 (e.g., a digitalprinter or the like).

In addition to the hardware/software environment described above, adifferent aspect of the invention includes a computer-implemented methodfor performing the above method. As an example, this method may beimplemented in the particular environment discussed above.

Such a method may be implemented, for example, by operating a computer,as embodied by a digital data processing apparatus, to execute asequence of machine-readable instructions. These instructions may residein various types of signal-bearing media.

Thus, this aspect of the present invention is directed to a programmedproduct, comprising signal-bearing media tangibly embodying a program ofmachine-readable instructions executable by a digital data processorincorporating the CPU 1211 and hardware above, to perform the method ofthe invention.

This signal-bearing media may include, for example, a RAM containedwithin the CPU 1211, as represented by the fast-access storage forexample. Alternatively, the instructions may be contained in anothersignal-bearing media, such as a magnetic data storage diskette 1300(FIG. 13), directly or indirectly accessible by the CPU 1211.

Whether contained in the diskette 1300, the computer/CPU 1211, orelsewhere, the instructions may be stored on a variety ofmachine-readable data storage media, such as DASD storage (e.g., aconventional “hard drive” or a RAID array), magnetic tape, electronicread-only memory (e.g., ROM, EPROM, or EEPROM), an optical storagedevice (e.g. CD-ROM, WORM, DVD, digital optical tape, etc.), paper“punch” cards, or other suitable signal-bearing media includingtransmission media such as digital and analog and communication linksand wireless. In an illustrative embodiment of the invention, themachine-readable instructions may comprise software object code.

The Present Invention as an Apparatus with Software Modules

In another aspect of the present invention, it will be readilyrecognized that the exemplary information handling/computer system 1200or the exemplary magnetic data storage diskette 1300 shown in FIGS. 12and 13, respectively, as embodying the present invention in the form ofsoftware modules, might include a computer program 1400 having softwaremodules exemplarily shown in FIG. 14.

Software module 1401 comprises a graphic user interface (GUI) to allow auser to enter parameters, control the progressive learning modeldevelopment, and view results. Software module 1402 comprises a memoryinterface to allow data from the database to be retrieved for the modeldevelopment and to store results as the modeling progresses.

Software module 1403 comprises a module that divides the database datainto the N segments for the progressive modeling. Software module 1404comprises a calculator for developing the base classifier for eachsegment. Finally, software module 1405 comprises a calculator fordeveloping the ensemble model from the base classifiers.

The Present Invention as a Business Method/Service

In yet another aspect of the present invention and as one of ordinaryskill in the art would readily recognize after having read thisapplication, the technique discussed herein has commercial value as wellas academic value.

That is, the present invention significantly increases both accuracy ofthe model and the throughput of prediction (e.g., by at least 1000% to2000%). If the training time by a conventional ensemble takes one day,using the approach of the present invention, it would take about onehour. These benefits are significant, since they mean that using thisapproach, the same amount of hardware can process twice to ten times asmuch data. Such a significant increase in throughput will scale up manyapplications, such as homeland security, stock trading surveillance,fraud detection, aerial space images, and others, where the volume ofdata is very large.

Therefore, as implemented as a component in a service or businessmethod, the present invention would improve accuracy and speed in anyapplication that uses inductive learning models. This commercial aspectis intended as being fully encompassed by the present invention.

One of ordinary skill in the art, after having read the presentapplication, would readily recognize that this commercial aspect couldbe implemented in a variety of ways. For example, a computing serviceorganization or consulting service that uses inductive learningtechniques as part of their service would benefit from the presentinvention. Indeed, any organization that potentially relies on resultsof modeling by inductive learning techniques, even if these results wereprovided by another, could benefit from the present invention.

It would also be readily recognized that the commercial implementationof the present invention could be achieved on a computer network, suchas the Internet, and that various parties could be involved in animplementation such as on the Internet. Thus, for example, a serviceprovider might make available to clients one or more inductive learningmodeling programs that incorporate the present invention. Alternatively,a service provider might provide the service of executing the presentinvention on a database provided by a client.

All of these variations of commercial implementations of the presentinvention, and any others that one of ordinary skill in the art, afterreading the present application, would recognize as within the scope ofthe present invention, are considered as being encompassed by thisinvention.

While the invention has been described in terms of exemplaryembodiments, those skilled in the art will recognize that the inventioncan be practiced with modification within the spirit and scope of theappended claims.

Further, it is noted that Applicants' intent is to encompass equivalentsof all claim elements, even if amended later during prosecution.

1. A method of processing an inductive learning model for a dataset ofexamples, said method comprising: dividing said dataset into a pluralityof subsets of data; and developing an estimated learning model for saiddataset by developing a learning model for a first subset of saidplurality of subsets.
 2. The method of claim 1, further comprising:progressively forming an ensemble model of said dataset by sequentiallydeveloping a learning model for each of a successive one of saidplurality of subsets, until a desired indication of termination has beenreached.
 3. The method of claim 1, further comprising: developing atleast one of a current accuracy and an estimated final accuracy, saidcurrent accuracy comprising an accuracy of said learning model for saidfirst subset, said estimated final accuracy comprising an estimatedaccuracy of said estimated learning model for said dataset.
 4. Themethod of claim 2, further comprising: developing at least one of acurrent accuracy and an estimated final accuracy, said current accuracycomprising an accuracy of said learning model for said subset beingcurrently developed, said estimated final accuracy comprising anestimated accuracy of said ensemble model of said dataset.
 5. The methodof claim 2, further comprising: developing an estimated training time tocomplete development of said ensemble model.
 6. The method of claim 3,wherein each said example in said dataset carries a benefit and saidaccuracy comprises an overall accuracy that reflects an estimated totalamount of reward from said benefits.
 7. The method of claim 6, whereinsaid benefit is not equal for all said examples, said learningcomprising a cost-sensitive learning, and said accuracy comprises anoverall accuracy that reflects an estimated total amount of reward fromsaid benefits in units of money.
 8. An apparatus for processing aninductive learning model for a dataset of examples, said apparatuscomprising: a database divider for dividing said dataset into N subsetsof data; and a base classifier calculator for developing a learningmodel for data in a first subset of said N subsets.
 9. The apparatus ofclaim 8, further comprising: an ensemble calculator for progressivelydeveloping an ensemble model of said database of examples bysuccessively integrating a base classifier from successive subsets ofsaid N subsets.
 10. The apparatus of claim 9, further comprising: amemory interface to retrieve data from said database and to store dataas said inductive learning model is progressively developed; and agraphic user interface to allow a user to selectively enter parameters,to control the progressive development of said ensemble model, and toview results of said progressive development.
 11. A system to process aninductive learning model for a dataset of example data, said systemcomprising one or more of: a memory containing one or more of aplurality of segments of said example data, wherein each said segment ofexample data comprises data for calculating a base classifier for anensemble model of said dataset; a base classifier calculator fordeveloping a learning model for data in one of said N segments; anensemble calculator for progressively developing an ensemble model ofsaid database of examples by successively integrating a base classifierfrom successive ones of said N segments; a memory interface to retrievedata from said database and to store data as said inductive learningmodel is progressively developed; and a graphic user interface to allowa user to at least one of enter parameters, to control the progressivedevelopment of said ensemble model, and at least one of display andprintout results of said progressive development.
 12. A method ofproviding a service, said method comprising at least one of: providing adatabase of example data to be used to process an inductive learningmodel for said example data, wherein said inductive learning model isderivable by dividing said example data into N segments and using atleast one of said N segments of example data to derive a base classifiermodel; receiving said database of example data and executing said methodof deriving said inductive learning model; providing an inductivelearning model as derived; executing an application of an inductivelearning model as derived; and receiving a result of said executing saidapplication.
 13. A method of deploying computing infrastructure,comprising integrating computer-readable code into a computing system,wherein the code in combination with the computing system is capable ofprocessing an inductive learning model for a dataset of examples by:dividing said dataset into N subsets of data; and developing anestimated learning model for said dataset by developing a learning modelfor a first subset of said N subsets.
 14. A signal-bearing mediumtangibly embodying a program of machine-readable instructions executableby a digital processing apparatus to perform a method of processing aninductive learning model for a dataset of examples, said methodcomprising: dividing said dataset into N subsets of data; and developingan estimated learning model for said dataset by developing a learningmodel for a first subset of said N subsets.
 15. The signal-bearingmedium of claim 14, said method further comprising: progressivelyforming an ensemble model of said dataset by sequentially developing alearning model for each of a successive one of said N subsets, until adesired indication of termination has been reached.
 16. Thesignal-bearing medium of claim 15, said method further comprising:developing at least one of a current accuracy and an estimated finalaccuracy, said current accuracy comprising an accuracy of said learningmodel for said subset being currently developed, said estimated finalaccuracy comprising an estimated accuracy of said ensemble model of saiddataset.
 17. The signal-bearing medium of claim 15, said method furthercomprising: developing an estimated training time to completedevelopment of said ensemble model.
 18. The signal-bearing medium ofclaim 16, wherein each said example in said dataset carries a benefitand said accuracy comprises an overall accuracy that reflects anestimated total amount of reward from said benefits.
 19. Thesignal-bearing medium of claim 18, wherein said benefit is not equal forall said examples, said learning comprising a cost-sensitive learning,and said accuracy comprises an overall accuracy that reflects anestimated total amount of reward from said benefits in predeterminedunits.
 20. A method of at least one of increasing a speed of developmentof a learning model for a dataset of examples and increasing an accuracyof said learning model, said method comprising: dividing said datasetinto N subsets of data; and developing an estimated learning model forsaid dataset by developing a learning model for a first subset of said Nsubsets.
 21. The method of claim 20, further comprising: calculating anestimated accuracy for said leaming model.
 22. The method of claim 20,further comprising: calculating a remaining training time.
 23. Themethod of claim 20, further comprising: progressively, and stepwise,forming an ensemble model of said dataset by sequentially usingadditional said subsets to develop an additional learning model for saidsubset and incorporating each said additional learning model into anaggregate model to form said ensemble model, wherein said progressiveand stepwise forming can be terminated prior to developing an additionallearning model for all of said N subsets.
 24. The method of claim 20,wherein said examples carry potentially different benefits, said methodfurther comprising: calculating an estimation of an accumulated benefitfor said learning model.
 25. A method of developing a predictive model,said method comprising: for a dataset comprising a plurality ofelements, each said element comprising a feature vector, said datasetfurther comprising a true class label for at least a portion of saidplurality of elements, said true class labels allowing said dataset tobe characterized as having a plurality of classes, dividing at least apart of said portion of said plurality of elements having said trueclass label into N segments of elements; and learning a model forelements in at least one of said N segments, as an estimate for a modelfor all of said dataset.
 26. The method of claim 25, further comprising:using a second part of said portion of said plurality of elements havingsaid true class label as a validation set for said model.
 27. The methodof claim 26, further comprising: using said validation set to calculatea predicted accuracy for said model.
 28. The method of claim 25, furthercomprising: calculating an estimated training time for learning a modelbased on a remainder of said N segments.
 29. The method of claim 25,wherein said elements are each associated with a benefit, said methodfurther comprising: establishing a benefit matrix associated with saidplurality of classes, said benefit matrix defining a benefit for eachsaid element in said dataset as applicable for each said class.
 30. Themethod of claim 29, wherein said elements in said dataset canrespectively have different benefit values, said method furthercomprising: using a validation dataset to measure a validation of saidmodel; and calculating an aggregate benefit for said model, as based onsaid validation dataset.
 31. The method of claim 25, further comprising:progressively developing an ensemble model by successively learning amodel for elements in one of a remaining said N segments, wherein saidprogressively developing said ensemble model is terminable at any stage.32. The method of claim 31, further comprising: calculating at least oneof an accuracy and a remaining training time for said ensemble model.33. The method of claim 32, further comprising: entering a threshold forat least one of said accuracy and said remaining training time; andautomatically terminating said progressively developing said ensemblemodel whenever said threshold is exceeded.